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Abstract — We present a new algorithm IDS for incremental 
learning of deterministic finite automata (DFA). This algorithm 
is based on the concept of distinguishing sequences introduced 
in (T|. We give a rigorous proof that two versions of this 
learning algorithm correctly learn in the limit. Finally we present 
an empirical performance analysis that compares these two 
algorithms, focussing on learning times and different types of 
learning queries. We conclude that IDS is an efficient algorithm 
for software engineering applications of automata learning, such 
as formal software testing and model inference. 

Index Terms — Online learning, model inference, incremental 
learning, learning in the limit, language inference. 



I. Introduction 

In recent years, automata learning algorithms (aka. regular 
inference algorithms) have found new applications in software 
engineering such as formal verification (e.g. Gl. ll3l . J4) ) soft- 
ware testing (e.g. [5|,[6] ) and model inference (e.g. |7|). These 
applications mostly centre around learning an abstraction of a 
complex software system which can then be statically analysed 
(e.g. by model checking) to determine behavioural correctness. 
Many of these applications can be improved by the use of 
learning procedures that are incremental. 

An automata learning algorithm is incremental if: (i) it 
constructs a sequence of hypothesis automata Ho, Hi, ... from 
a sequence of observations oo,e>i,... about an unknown tar- 
get automaton A, and this sequence of hypothesis automata 
finitely converges to A, and (ii) the construction of hypothesis 
Hi can reuse aspects of the construction of the previous 
hypothesis (such as an equivalence relation on states). 

The notion of convergence in the limit, as a model of correct 
incremental learning originates in J8). 

Generally speaking, much of the literature on automata 
learning has focussed on offline learning from a fixed pre- 
existing data set describing the target automaton. Other ap- 
proaches, such as HI and [9 | have considered online learning, 
where the data set can be extended by constructing and posing 
new queries. However, little attention has been paid to incre- 
mental learning algorithms, which can be seen as a subclass of 
online algorithms where serial hypothesis construction using 
a sequence of increasing data sets is emphasized. The much 
smaller collection of known incremental algorithms includes 
the RPNI2 algorithm of AD, the IID algorithm of JED and 
the algorithm of Ifl2l . However, the motivation for incremental 
learning from a software engineering perspective is strong, and 
can be summarised as follows: 



1) to analyse a large software system it may not be feasible 
(or even necessary) to learn the entire automaton model, 
and 

2) the choice of each relevant observation Oj about a large 
unknown software system often needs to be iteratively 
guided by analysis of the previous hypothesis model 
Hi-i for efficiency reasons. 

Our research into efficient learning-based testing (LBT) 
for software systems (see e.g. fl3l . fl4*l . 0) has led us 
to investigate the use of distinguishing sequences to design 
incremental learning algorithms for DFA. Distinguishing se- 
quences offer a rather minimal and flexible way to construct 
a state space partition, and hence a quotient automaton that 
represents a hypothesis H about the target DFA to be learned. 
Distinguishing sequences were first applied to derive the ID 
online learning algorithm for DFA in 0]. 

In this paper, we present a new algorithm incremental 
distinguishing sequences {IDS), which uses the distinguishing 
sequence technique for incremental learning of DFA. In ||6l 
this algorithm has been successfully applied to learning based 
testing of reactive systems with demonstrated error discovery 
rates up to 4000 times faster than using non-incremental 
learning. Since little seems to have been published about the 
empirical performance of incremental learning algorithms, we 
consider this question too. The structure of the paper is as 
follows. In Section [TTJ we review some essential mathematical 
preliminaries, including a presentation of Angluin's original 
ID algorithm, which is necessary to understand the correctness 
proof for IDS. In Section [III] we present two different versions 
of the IDS algorithm and prove their correctness. These are 
called: (1) prefix free IDS, and (2) prefix closed IDS. In Section 
IV we compare the empirical performance of our two IDS 
algorithms with each other. Finally, in Section [V] we present 
some conclusions and discuss future directions for research. 

A. Related Work 

Distinguishing sequences were first applied to derive the ID 
online learning algorithm for DFA in |1|. The ID algorithm 
is not incremental, since only a single hypothesis automaton 
is ever produced. Later an incremental version IID of this 
algorithm was presented in IfTTl . Like the IID algorithm, our 
IDS algorithm is incremental. However in contrast with IID, 
the IDS algorithm, and its proof of correctness are much 
simpler, and some technical errors in IfTTl are also overcome. 
Distinguishing sequences can be contrasted with the complete 
consistent table approach to partition construction as repre- 
sented by the well known online learning algorithm L* of |9|. 
Unlike L*, distinguishing sequences dispose of the need for 
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an equivalence oracle during learning. Instead, we can assume 
that the observation set P contains a live complete set of input 



strings (see Section II-B below for a technical definition). Fur- 
thermore, unlike L* distinguishing sequences do not require a 
complete table of queries before building the partition relation. 
In the context of software testing, both of these differences 
result in a much more efficient learning algorithm. In particular 
there is greater scope for using online queries that have been 
generated by other means (such as model checking). Moreover, 
since LBT is a black-box approach to software testing, then 
the use of an equivalence oracle contradicts the black-box 
methodology. In ifTUl . an incremental version RPNI2 of the 
RPNI offline learning algorithm of lfT5ll and lfT6ll is presented. 
The RPNI2 algorithm is much more complex than IDS. It 
includes a recursive depth first search of a lexicographically 
ordered state set with backtracking, and computation of a 
non-deterministic hypothesis automaton that is subsequently 
rendered deterministic. These operations have no counterpart 
in IDS. Thus IDS is easier to verify and can be quickly 
and easily implemented in practise. The incremental learning 
algorithm introduced in [12] requires a lexicographic ordering 
on the presentation of online queries, which is less flexible 
than IDS, and indeed inappropriate for software engineering 
applications. 

II. Preliminaries 

A. Notations and concepts for DFA 

Let E be any set of symbols then E* denotes the set of all 
finite strings over E including the empty string A. The length 
of a string a G E* is denoted by \a\ and |A| = 0. For strings 
a, f3 € £* , aft denotes their concatenation. For a, f3, 7 € £*, 
if a = /?7 then f3 is termed a prefix of a and 7 is termed a suffix 
of a. We let Pref(a) denote the prefix closure of a, i.e. the set 
of all prefixes of a. We can also apply prefix closure pointwise 
to any set of strings. The set difference operation between 
two sets U and V denoted by U — V is the set of elements 
of U which are not members of V. The symmetric difference 
operation defined on pairs of sets is defined by U © V = 
(U — V) U (V — U). A deterministic finite automaton (DFA) 
is a quintuple A = (H,Q,F,qo,8), where: £ is the input 
alphabet, Q is the state set, F C Q is the set of final states, 
go G Q is the initial state and state transition function 6 is a 
mapping 8 : Q x £ — > Q, and 8(qi, b) — qj meaning when in 
state qi E Q given input b the automaton A will move to state 
qj G Q in one step. We extend the function 8 to a mapping 
8* = Q x £* — s- Q inductively defined by 8 = (q, A) = q and 
8* = [q l ,b 1 , ...,&„) = 5(6*(q,bx, b n _i), &„). The language 
L(A) accepted by A is the set of all strings a G £* such 
that 8* (qo, a) G F. As is well known a language L C £*is 
accepted by DFA if and only if L is regular, i.e. L can be 
defined by a regular grammar. A state q G Q is said to be live 
if for some string a G £* , 8* (q, a) G F , otherwise q is said to 
be dead. Given a distinguished dead state do we define string 
concatenation modulo the dead state do, / : £* U {do} x £ — > 
£* U {d }, by f(d , a) = d and f(a, a) — a.a for a G E*. 
This is function is used for automaton learning in Section III 
Given any DFA A there exists a minimum state DFA A' such 
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that L(A) = L(A') and this automaton is termed the canonical 
DFA for L(A). A canonical DFA has one dead state at the 
most. We represent DFA graphically in the usual way using 
state diagrams. States are represented by small circles labelled 
by state names and final states among them are marked by 
concentric double circles. The initial state is represented by 
attaching a right arrow — > to it. The transitions between the 
states are represented by directed arrows from state of origin 
to the destination state. The symbol read from the origin state 
is attached to the directed arrow as a label. Fig [T] shows state 
transition diagram of one such DFA. 

B. The ID Algorithm 

Our IDS algorithm is an incremental version of the ID 
learning algorithm introduced in [l j. The ID algorithm is an 
online learning algorithm for complete learning of a DFA that 
starts from a given live complete set P C E* of queries about 
the target automaton, and generates new queries until a state 
space partition can be constructed. Since the algorithmic ideas 
and proof of correctness of IDS are based upon those of ID 
itself, it is useful to review the ID algorithm here. Algorithm 
1 presents the ID algorithm. Since this algorithm has been 
discussed at length in (TJ, our own presentation can be brief. 
A detailed proof of correctness of ID and an analysis of its 
complexity can be found in [1|. A finite set P C E* of input 
strings is said to be live complete for a DFA A if for every live 
state q G Q there exists a string a G P such that 8* (qo, a) = q. 
Given a live complete set P for a target automaton A, the 
essential idea of the ID algorithm is to first construct the set 
V = PU{f(a,b)\(a,b) G Px E}U{d } of all one element 
extensions of strings in P as a set of state names for the 
hypothesis automaton. The symbol d is added as a name 
for the canonical dead state. This set of state names is then 
iteratively partitioned into sets Ei(a) C T' for i = 0, 1, . . . 
such that elements a, /3 of T' that denote the same state in 
A will occur in the same partition set, i.e. Ei(a) = Ei(/3). 
This partition refinement can be proven to terminate and the 
resulting collection of sets forms a congruence on T', Finally 
the ID algorithm constructs the hypothesis automaton as the 
resulting quotient automaton. The method used to refine the 
partition set is to iteratively construct a set V of distinguishing 
strings, such that no two distinct states of A have the same 
behaviour on all of V. 

We will present the ID and IDS algorithms so that sim- 
ilar variables share the same names. This pedagogic device 
emphasises similarity in the behaviour of both algorithms. 
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Algorithm 1 The ID Learning Algorithm 

Input: A live complete set PCS* and a teacher DFA A to 

answer membership queries a 6 L(A). 

Output: A DFA M equivalent to the target DFA A. 

1 ) begin 

2) //Perform Initialization 

3) i = 0, Vi = A, V = {vi} 

4) P> = PU{d }, T = PU{/(a, b)\(a, b) 6 Px S}, T> = TU{d } 

5) Construct function Po for vo = A, 

6) E o (rfo) = 

7) Vq e T 

8) { pose the membership query "a £ L(A)?" 

9) if the teacher's response is yes 

10) then E (a) = {A} 

11) elseP o (a)=0 

12) end if 

13) } 

14) //Refine the partition of the set T' 

15) while (3a, (3 £ P' and 6 e £ such that = Pi(/3) but 
Ei(f(a,b))^Ei(f(l3,b))) 

16) do 

17) Let 7 eP l (/(a,fe))eP l (/(/3,fe)) 

18) = 67 

19) V = VU{v i+ i},i = i + l 

20) Va £ pose the membership query "cxVi £ L(A)7" 

21) { 

22) if the teacher's response is yes 

23) then Ei(a) = Ei_ x (a) U { Vi } 

24) elseS < (a) = B i _i(Q ! ) 

25) end if 

26) ) 

27) end while 

28) //Construct the representation M of the target DFA A. 

29) The states of M are the sets P,;(a), where a £ T 

30) The initial state 50 is the set Pi (A) 

31) The accepting states are the sets Pi (a) where a £ T and A £ Ei(ct) 

32) The transitions of M are defined as follows: 

33) Va £ P' 

34) if Pi (a) = 

35) then add self loops on the state Pi (a) for all b £ S 

36) else Vb £ E set the transition 5(Pi(a), 6) = Pi(/(a, 6)) 

37) end if 

38) end. 




Figure 1. Target Automaton A 




Figure 2. Hypothesis Automaton M\ 



Algorithm 2 The IDS Learning Algorithm 

Input: A file S = si , . . . , s; of input strings Si £ S* and a teacher DFA A 
to answer membership queries a £ L(A)1 

Output: A sequence of DFA Mt for t = 0, . . . , I as well as the total number 
of membership queries and book keeping queries asked by the learner. 

1 ) begin 

2) //Perform Initialization 

3) i = 0, k = 0, t = 0, Vi = A, V = {v^ 

4) //Process the empty string 

5) P = {A}, = P U {d }, T = P U S 

6) Po(rfo) = 

7) Va £ To { 

8) pose the membership query "a £ L(A)T\ 

9) bquery = bquery + 1 

10) if the teacher's response is yes 

11) then Eo(a) = {A} 

12) elseP o (a)=0 

13) ) 

14) //Refine the partition of set Pg as described in Algorithm 3 

15) //Construct the current representation Mo of the target DFA 

16) //as described in Algorithm 4. 

17) 

18) //Process the file of examples. 

19) while (5 ^ empty) 

20) do 

21) read( S, a ) 

22) mquery = mquery +1 

23) k = k+1, t = t+1 

24) P fc = P fc _! U {a} 

25) // P fe = P fc _i U Pref(a) //prefix closure 

26) Pfc = Pfe u {do} 

27) T fc = T fc _ 1 U{a}U{/(a,6)|6 6S} 

28) // Pt_= T fc -i U Pre/(a) U {/(a, b)\a £ P fe - P&_i, 6 £ £} 

29) //Line |28| for prefix closure 

30) T' k ±T k u {d } 

31) VaETfe-Tfe.! 

32) { 

33) // Fill in the values of Pi (a) using membership queries: 

34) Pj(a) = {vj\0 <j< i,aVj £ L(A)} 

35) bquery = bquery + i 

36) } 

37) // Refine the partition of the set Tfc 

38) if a is consistent with Mt—i 

39) then M t = M t -i 

40) else construct Mt as described in Algorithm 4. 

41) end while 

42) end. 
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Algorithm 3 The Refine Partition Algorithm 

1) while (3a, P G P' k and b e S such that E,{a) = Ei(/3) 

but Ei(f (a, b)) ^ Ei(f(p,b))) 
do 

Let 7 eEi(f(a, b)) ®E t {f(p,b)) 
v i+1 = f>7 

V = V U {v i+1 }, i = i + l 

Va E Tfe pose the membership query "avi £ L(A)7" 
{ 

bquery = bquery + 1 
if the teacher's response is yes 
then Ei{a) = £i_i(a) U {vi} 
else Ei(a) = Ei-i(a) 
end if 



2) 
3) 
4) 
5) 
6) 
7) 
s. 

9) 
10) 

11) 
12) 

13) 

14) end while 



Algorithm 4 The Automata Construction Algorithm 



1) 

2) 
3) 
4) 
5) 
6) 
7) 
8) 
9) 
10) 
11) 
12) 
13) 



The states of Mt are the sets Ei(ct), where a £ Tj. 
The initial state qo is the set Ei(X) 

The accepting states are the sets Ei(a) where a £ Tj. and A £ -Ei(o) 
The transitions of Mt are defined as follows: 
Vq£P^ 

if ^ (a) = 

then add self loops on the state Ei(a) for all b £ S 
else Vb £ E set the transition 6(Ei(a), b) = Ei(f(a, b)) 
end if 
V/3 £ T k - P' h 

if Va £ P£ Ei(/3) + Ei(a) and £ B (/3) ^ 
then Vb £ E set the transition S(Ei(f}), b) = 
end if 



However, there are also important differences in behaviour. 
Thus, when analysing the behavioural properties of program 
variables we will carefully distinguish their context as e.g. 
v™, E^ D (a), . . ., and ^ DS , E^ DS (a), . . . etc. Our proof of 
correctness for IDS will show how the learning behaviour of 
IDS on a sequence of input strings s\, . . . , s n E S* can be 
simulated by the behaviour of ID on the corresponding set of 
inputs {si, . . . s n }. Once this is established, one can apply the 
known correctness of ID to establish the correctness of IDS. 



Figure 4. Average Time Complexity 




■•■MMQ(Prefix) 

-♦-MBQ(Prefix)) 
V MMQ(PrefixFree) 
-*-MBQ(Prefix Free) 



Figure 5. Average Membership/Book-keeping Queries 



C. Behavioural differences between IID and IDS 

The IID algorithm of [11] also presents a simulation method 
for ID. However following points of difference in behaviour 
of IID and IDS are worth mentioning: 

1) IID starts from a null DFA as hypothesis while IDS 
constructs the initial hypothesis after reading all a € £ 
from the initial state. 

2) IID discards all negative examples and waits for the 
first positive example after the construction of the initial 
(null) hypothesis to do further construction. IDS on the 
other hand does construction with all negative and posi- 
tive examples after building an initial hypothesis which 
makes it more useful for practical software engineering 
applications identified in Section [I] 

3) IID in some cases builds hypotheses which have a 
partially defined transition function 5 rather than being 
left total. This will be shown with an example in the 
next section. IDS fixes this problem due to lines 10-13 
of Algorithm [4] described in this paper. 

4) Unlike IDS, there is no prefix free version of IID. 
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5) In addition to the above it is easily shown that IID does 
not satisfy our Simulation Theorem |2j and thus the two 
algorithms are quite different. The behavioural proper- 
ties of ID that are needed to complete this correctness 
proof can be stated as follows. 
Theorem 1: (i) Let PC S* be a live complete set for 
a DFA A containing X. Then given P and A as input, the 
ID algorithm terminates and the automaton M returned is the 
canonical automaton for L(A). (ii) Let I € N be the maximum 
value of program variable i ID given P and A. For all < n < I 
and for all a G T, 

K D (°) = W 3 ° I < 3 < n, avj D e L(A)}. 

Proof: (i) See [1J Theorem 3. (ii) By induction on n. ■ 

One difference between ID and IID is the frequency of hy- 
pothesis automaton construction. With ID this occurs just once, 
after a single partition refinement is completed. However, with 
IID this occurs regularly, after each partition refinement. This 
difference means that the automaton construction algorithm 
(Algorithm [T] lines 28-37) used for ID can no longer be used 
for IID, (as asserted in ifTTIn as we show below. Suppose 
we want to leam the automaton A shown in Fig [T] A 
positive example for this automaton is (b,+). If we use it 
to start learning A using the IID algorithm of ifTTI . we have 
P = Pref(b) = {6, A} and P^ = {6, A} U {d Q } = {b, A, d Q }. 
We then obtain the sets T = P U {f(a,b)\(a,b) G 
P Q x £} = {6, A} U {b, A} x {a, b} = {A, a, 6, 6a, 66} and 
Tq = To U {do} = {do, A, a, b, ba, bb}. The initial column for 
the table of partition sets is constructed as shown in Table [I] 

For i = and vo = A. From this column we see that 
two elements of the set Pq have the same value, i.e E(do) = 
E(X) but E(f(do,b)) ^ E(f(X,b)). Therefore we have 7 G 
E(f(d , b))®E(f(X, b)) or 7 G 0®{A}. We can choose 7 = A 
which gives the distinguishing string v\ — 67 = bX = b. We 
then extend Table [I]for i = 1. Now we can see that all elements 
of the set Pq, which are 6, X and do, have distinct values in 
the last column of the table and so no further refinement of 
the partition is possible. At this stage IID constructs the next 
hypothesis automaton M x as shown in Figure [2] 

Now we observe that the transition function 5 for this 
automaton is only partially defined, since there are no outgoing 
transitions for the state named {A}. Therefore the IID algo- 
rithm of ifm does not always generate a hypothesis automaton 
with well defined transition function S. This created problems 
when we tried to use this algorithm for practical software 
engineering applications identified in Section [I] e.g. a model 
checker when used to verify this kind of a hypothesis can get 
stuck in such a state with no outgoing transitions. Similarly, 
an automata equivalence checker which is used to terminate 
a learning-based testing process goes into an infinite loop 
because of such a state since it will never find its equivalent 
state in the target automaton. 

The problem seems to stem from an unclosed table in 
some executions of IID. The notion of closed and consistent 
observation table is given in (9) for L* algorithm. The L* 
algorithm also incrementally builds the observation table but 
keeps asking queries until it becomes closed and consistent. In 
this case the table will be closed when V/3 G T \ P' 3a G P' 



such that Ei(a) = E0). If E^a) f E,((3) as in the 
above example where Pi (66) = {A} is not equal to any 
of Pi (d ) = 0,Pi(A) = {6} or E x {b) = {A, 6} then the 
solution in (9] is to move j3 G T \ P' to set P' and ask 
more queries to rebuild the congruence that might have been 
affected by this last addition to set P' . However L* is a 
complete learning algorithm and it only outputs the description 
of the hypothesis automaton after learning it completely. 
Therefore for incremental learning the simplest fix is to set 
the transitions of such states to the dead state as done in lines 
10-13 of Algorithm |4] This doesn't require any new entries or 
shuffling the previous entries up in the table thus keeping the 
congruence intact. 

III. Correctness of IDS Algorithm 

In this section we present our IDS incremental learning 
algorithm for DFA. In fact, we consider two versions of this 
algorithm, with and without prefix closure of the set of input 
strings. We then give a rigorous proof that both algorithms 
correctly learn an unknown DFA in the limit in the sense 
of ID. In Algorithm [2] we present the main IDS algorithm, 
and in Algorithms [3] and [4] we give its auxiliary algorithms 
for iterative partition refinement and automaton construction 
respectively. The version of the IDS algorithm which appears 
in Algorithm [2] we term the prefix free IDS algorithm, due to 
lines [24] and [27] Notice that lines [25] and [28] of Algorithm [2] 
have been commented out. When these latter two lines are 
uncommented and instead lines [24] and [27] are commented 
out, we obtain a version of the IDS algorithm that we term 
prefix closed IDS. We will prove that both prefix closed and 
prefix free IDS learn correctly in the limit. However, in Section 
[TV] we will show that they have quite different performance 
characteristics in a way that can be expected to influence 
applications. 

We will prove the correctness of the prefix free IDS al- 
gorithm first, since this proof is somewhat simpler, while 
the essential proof principles can also be applied to verify 
the prefix closed IDS algorithm. We begin an analysis of 
the correctness of prefix free IDS by confirming that the 
construction of hypothesis automata carried out by Algorithm 
|4]is well defined. 

Proposition 1: For each t > the hypothesis automaton M t 
constructed by the automaton construction Algorithm^after t 
input strings have been observed is a well defined DFA. 

Proof: The main task is to show S to be well defined 
function and uniquely defined for every state Pi(a), where 
a G T fc . 

Proposition [T] establishes that Algorithm [2] will generate a 
sequence of well defined DFA. However, to show that this 
algorithm learns correctly, we must prove that this sequence of 
automata converges to the target automaton A given sufficient 
information about A. It will suffice to show that the behaviour 
of prefix free IDS can be simulated by the behaviour of ID, 
since ID is known to learn correctly given a live complete set 
of input strings (c.f. Theorem[T](i)). The first step in this proof 
is to show that the sequences of sets of state names P£ DS and 
Tk DS generated by prefix free IDS converge to the sets P ID 
and T ID of ID. ■ 
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Proposition 2: Let S — s\, . . . , si be any non-empty se- 
quence of input strings Si E S* for prefix free IDS and let 
P ID = {A, si, . . . , si} be the corresponding input set for ID. 
(i) For all 0<k<l, P£ DS = {A, s u . . . , s k } C P ID . (ii) For 
allQ<k< l,Tl DS = Pl DS U{f(a,b)\a € P£ DS ,be £} C 
T ID . (Hi) P t IDS = P ID andTl DS = T ID . 

Proof: Clearly (iii) follows from (i) and (ii). We prove (i) 
and (ii) by induction on k. ■ 

Next we turn our attention to proving some fundamental 
loop invariants for Algorithm [2] Since this algorithm in turn 
calls the partition refinement Algorithm [3] then we have in 
effect a doubly nested loop structure to analyse. Clearly the 
two indexing counters k IDS and i IDS (in the outer and inner 
loops respectively) both increase on each iteration. However, 
the relationship between these two variables is not easily 
defined. Nevertheless, since both variables increase from an 
initial value of zero, we can assume the existence of a 
monotone re-indexing function that captures their relationship. 

Definition 1: Let S = s\,...,si be any non-empty se- 
quence of strings Si £ £*. The re-indexing function K s : 
N — > N for prefix free IDS on input S is the unique 
monotonically increasing function such that for each n G N, 
K s (n) is the least integer m such that program variable k IDS 
has value m while the program variable i IDS has value n. 
Thus, for example, K (0) = 0. When S is clear from the 
context, we may write K for K s . 

With the help of such re-indexing functions we can express 
important invariant properties of the key program variables 
vj DS and E I n DS {a), and via Proposition |5] their relationship 
to vj D and E^ D (a). Corresponding to the doubly nested loop 
structure of Algorithm[2] the proof of Theorem [2] below makes 
use of a doubly nested induction argument. 

Theorem 2: (Simulation Theorem) Let S — 8%, . . . , s; be 
any non-empty sequence of strings Si <E S*. For any execution 
of prefix free IDS on S there exists an execution of ID on 
{A, si, . . . ,si} such that for all m > 0: 

(i) For all n > if K(n) = m then: (a) for all < j < n, 

(c)for 



Using the detailed analysis of the invariant properties of the 
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v\ u , (b) for all < j < n, v 1 n ub ^ v 



JDS 



IDS 



all a G Ti DS , E I n DS {a) = {^ /DS |0 < j < n,av 
L(A)}. (ii) If m > then let p G N be the greatest integer 
such that if (p) = m - 1. Then for all a G T^ DS , E I p DS (a) = 
{•u^lO < j < p, cn;j DS G L(A)}. (iii) The mth partition 
refinement of IDS terminates. 

Proof: By induction on m using Proposition |2|i). ■ 
Notice that in the statement of Theorem [2] above, since both 
ID and IDS are non-deterministic algorithms (due to the non- 
deterministic choice on line 17 of Algorithm [T] and line 3 of 
Algorithm [3]), then we can only talk about the existence of 
some correct simulation. Clearly there are also simulations of 
IDS by ID which are not correct, but this does not affect the 
basic correctness argument. 

Corollary 1: LetS — si, . . . , si be any non-empty sequence 
of strings Si G S*. Any execution of prefix free IDS on S 
terminates with the program variable k IDS having value I. 

Proof: Follows from Simulation Theorem [2] (iii) since 
clearly the while loop of Algorithm [2] terminates when the 
input sequence S is empty. ■ 



program variables P^ DS and T^ DS in Proposition |5] and vj DS 
and E^ DS (a) in Simulation Theorem |5] it is now a simple 
matter to establish correctness of learning for the prefix free 
IDS Algorithm. 

Theorem 3: (Correctness Theorem) Let S — s\,...,si 
be any non-empty sequence of strings Si G S* such that 
{A, si, . . . , s;} is a live complete set for a DFA A. Then prefix 
free IDS terminates on S and the hypothesis automaton Mf DS 
is a canonical representation of A. 

Proof: By Corollary [T[ prefix free IDS terminates on 
S with the variable fc /£>5 having value /. By Simulation 
Theorem |2](i) and Theorem [T](ii), there exists an execution 
of ID on {A,si, . . . ,si} such that E I n DS (a) = E I n D {a) for 
all a G Tl DS and any n such that K(n) = I. By Proposition 
|](iii), T{ DS = T ID and P( IDS = P' ID . So letting M ID 
be the canonical representation of A constructed by ID using 
{A, s\, . . . , si} then M ID and Ml s have the same state sets, 
initial states, accepting states and transitions. ■ 

Our next result confirms that the hypothesis automaton Ml DS 
generated after t input strings have been read is consistent with 
all currently known observations about the target automaton. 
This is quite straightforward in the light of Simulation Theo- 
rem 12 

Theorem 4: (Compatibility Theorem) LetS = Sx,...,si 
be any non-empty sequence of strings Sj G £*. For each < 
t < I, M^ DS is compatible with A on {A, si, . . . , St}. 

Proof: By definition, M( DS is compatible with A on 
{A, sx, . . . , St} if, and only if, for each < j < t, Sj G 
L(A) A G Ef t DS (sj), where i t is the greatest integer 
such that K{i t ) = t and the sets E- t DS (a) for a G T t IDS 
are the states of M^ DS . Now Vq DS = A. So by Simulation 
Theorem E](i).(c), if Sj G L(A) then SjV& DS £ L(A) so 
vl DS G Ej DS (s j ), i.e. A G Ef DS (sj), and if Sj £ L(A) then 
S y DS £ L(A) so vi DS £ El DS { Sj ), i.e. A £ E( t DS ( Sj ). M 

Let us briefly consider the correctness of prefix closed IDS. 
We begin by observing that the non-sequential ID Algorithm 
[T] does not compute any prefix closure of input strings. 
Therefore, Proposition [2] does not hold for prefix closed IDS. 
In order to obtain a simulation between prefix closed IDS and 
ID we modify Proposition [2] to the following. 

Proposition 3: Let S = si, . . . , si be any non-empty se- 
quence of input strings Sj G S* for prefix closed IDS and 
let P ID = Pref({\, Si, . . . , s;}) be the corresponding input 
set for ID. 

(i) For allO < k < I, P* DS = Pref({X,s u . . .,s k }) C P ID . 

(ii) ForallO < k < I, Tl DS = P^ 5 U{/(a, b)\a G P{ ID , b G 
£} C T ID . (Hi) P} DS = P ID and T{ DS = T ID 

Proof: Similar to the proof of Proposition [2] ■ 

Theorem 5: (Correctness Theorem) Let S = si,...,sj 
be any non-empty sequence of strings Sj G S* such that 
{A, s\, . . . , si} is a live complete set for a DFA A. Then prefix 
closed IDS terminates on S and the hypothesis automaton 
is a canonical representation of A. 

Proof: Exercise, following the proof of Theorem [3] ■ 
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IV. Empirical Performance Analysis 

Little seems to have been published about the empirical per- 
formance and average time complexity of incremental learning 
algorithms for DFA in the literature. By the average time 
complexity of the algorithm we mean the average number of 
queries needed to completely learn a DFA of a given state 
space size. This question can be answered experimentally by 
randomly generating a large number of DFA with a given 
state space size, and randomly generating a sequence of query 
strings for each such DFA. From the point of view of software 
engineering applications such as testing and model inference, 
we have found that it is important to distinguish between 
the two types of queries about the target automaton that 
are used by IDS during the learning procedure. On the one, 
hand the algorithm uses internally generated queries (we call 
these book-keeping queries) and on the other hand it uses 
queries that are supplied externally by the input file (we 
call these membership queries). From a software engineering 
applications viewpoint it seems important that the ratio of 
book-keeping to membership queries should be low. This 
allows membership queries to have the maximum influence 
in steering the learning process externally. The average query 
complexity of the IDS algorithm with respect to the numbers 
of book-keeping and membership queries needed for complete 
learning can also be measured by random generation of DFA 
and query strings. To measure each query type, Algorithm [2] 
has been instrumented with two integer variables bquery and 
mquery intended to track the total number of each type of 
query used during learning (lines |9| [22] and 35 I. Since two 
variants of the IDS algorithm were identified, with and without 
prefix closure of input strings, it was interesting to compare 
the performance of each of these two variants according to the 
above two average complexity measures. 

To empirically measure the average time and query com- 
plexity of our two IDS algorithms, two experiments were set 
up. These measured: 

(1) the average computation time needed to learn a randomly 
generated DFA (of a given state space size) using randomly 
generated membership queries, and 

(2) the total number of membership and book-keeping 
queries needed to learn a randomly generated DFA (of a 
given state space size) using randomly generated membership 
queries. We chose randomly generated DFA with state space 
sizes varying between 5 and 50 states, and an equiprobable 
distribution of transitions between states. No filtering was 
applied to remove dead states, so the average effective state 
space size was therefore somewhat smaller than the nominal 
state space size. 

The experimental setup consisted of the following compo- 
nents: 

(1) a random input string generator, 

(2) a random DFA generator, 

(3) an instance of the IDS Algorithm (prefix free or prefix 
closed) , 

(4) an automaton equivalence checker. 

The architecture of our evaluation framework and the flow 
of data between these components are illustrated in Figure [3] 



The random input string generator constructed strings over the 
set of alphabet £ of the target automaton and the length of the 
generated strings was always < |Q| of the target automaton. 
Since IDS begins learning by reading the null string, therefore, 
null string was only provided externally and wasn't generated 
randomly to avoid unnecessary repetition. The random DFA 
generator started by building a specific sized state set Q. 
The number of final states |F| < \Q\ was chosen randomly 
and then these final states were again marked randomly from 
the state set Q. The initial state was also chosen randomly 
from the state set. Similarly the transition function 5 was 
constructed by randomly assigning next states from set Q 
for each state q E Q after reading each alphabet a G S. 
The IDS algorithms and the entire evaluation framework were 
implemented in Java. The performance of the input string and 
DFA generators is dependent on Java's Random class which 
generates pseudorandom numbers that depend upon a specific 
seed. To minimize the chance of generating the same pseudo 
random strings/automata again the seed was set to the system 
clock where ever possible. 

The purpose of the equivalence checker was to terminate 
the learning procedure as soon as the hypothesis automaton 
sequence had successfully converged to the target automaton. 
There are several well known equivalence checking algorithms 
described in literature. These have runtime complexity ranging 
from quadratic to nearly linear execution times. We chose 
an algorithm with nearly linear time performance described 
in fiTl . This was to minimise the overhead of equivalence 
checking in the overall computation time. 

Ten different automata were generated randomly for each 
state size (ranging between 5 and 50). They were learned 
by both prefix free and prefix closed IDS and their learning 
times (in milli-seconds), number of book keeping queries and 
membership queries asked to reach the target were recorded. 
The graphs in Figure |4] and Figure [5] show a mean of ten 
experiments for all these values for both variants of IDS. 

A. Results and Interpretation 

The graphs in Figures [4] and [5] illustrate the outcome of our 
experiments to measure the average time and average query 
complexity of both IDS algorithms, as described in Section 
HV] 

Figure |4]presents the results of estimating the average learn- 
ing time for the prefix free and prefix closed IDS algorithms as 
a function of the state space size of the target DFA. For large 
state space sizes \Q\, the data sets of randomly generated target 
DFA represent only a small fraction of all possible such DFA 
of size \Q\. Therefore the two data curves are not smooth for 
large state space sizes. Nevertheless, there is sufficient data to 
identify some clear trends. The average learning time for prefix 
free IDS learning is substantially greater than corresponding 
time for prefix closed IDS, and this discrepancy increases with 
state space size. The reason would appear to be that prefix 
free IDS throws away data about the target DFA that must be 
regenerated randomly (since input string queries are generated 
at random). The average time complexity for prefix free IDS 
learning seems to grow approximately quadratically, while the 
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average time complexity for prefix closed IDS learning appears 
to grow almost linearly within the given data range. From 
this viewpoint, prefix-closed IDS appears to be the superior 
algorithm. 

Figure [5] presents the results of estimating the average 
number of membership queries and book-keeping queries as a 
function of the state space size of the target DFA. Again, we 
have compared prefix-closed with prefix free IDS learning. 
Allowance must also be made for the small data set sizes 
for large state space values. We can see that membership 
queries grow approximately linearly with the increase in state 
space size, while book-keeping queries grow approximately 
quadratically, at least within the data ranges that we consid- 
ered. There appears to be a small but significant decrease in 
the number of both book-keeping and membership queries 
used by the prefix-closed IDS algorithm. The reason for this 
appears to be similar to the issues identified for average time 
complexity. Prefix closure seems to be an efficient way to 
gather data about the target DFA. From the viewpoint of 
software engineering applications discussed in Section 1, now 
prefix free IDS appears to be preferable. This is because 
the decreasing ratio of book-keeping to membership queries 
improves the possibility to direct the learning process using 
externally generated queries (e.g. from a model checker). 

V. Conclusions 

We have presented two versions of the IDS algorithm which 
is an incremental algorithm for learning DFA in polynomial 
time. We have given a rigorous proof that both algorithms cor- 
rectly learn in the limit. Finally we have presented the results 
of an empirical study of the average time and query complexity 
IDS. These empirical results suggest that IDS algorithm is 
well suited to applications in software engineering, where an 
incremental approach that allows externally generated online 
queries is needed. This conclusion is further supported in [6] 
where we have evaluated the IDS algorithm for learning based 
testing of reactive systems, and shown that it leads to error 
discovery up to 4000 times faster than using non-incremental 
learning. We gratefully acknowledge financial support for this 
research from the Higher Education Commission (HEC) of 
Pakistan, the Swedish Research Council (VR) and the EU 
under project HATS FP7-231620. 
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